Introduction to R

Master 1 Economy & Psychology — Paris-Cité — 2025-2026

October 17, 2025

1 Lecture organisation

We only have three hours, but we’ll make the most of them!

  • This course introduces R, which is both
    • A programming language
    • A software
  • We only have three hours. Therefore, you will not be manipulating data in this class. Instead, the course provides you with a toolbox
    • Researchers mostly follow the same steps in each data analysis project: import data, clean and prepare data, describe data, analyse data, plot data, export the plots and results
    • Slides represent a function that is essential for the first three steps
    • (Free) online resources are provided for the last three steps
  • My aim is for you to grasp the fundamentals of R so that you can reuse the functions I demonstrate and train yourself using free online resources

2 The (very) basics of R

From Denise Earle on X

Why use R ?

  • R is an important tool for many data analysts who use it to prepare, analyse and plot data, as well as create websites, reports and presentations (like this one, which was made entirely in R). But most importantly, it
    • is free
    • is open-source
    • allows anyone to reproduce your results (see, for instance, the Institute For Replication)
    • is very powerful and flexible
    • has a very engaged community that you can ask for help

R’s logic

  • R is basically a big fancy calculator

R’s logic

  • R is basically a big fancy calculator

Very few people use base R, most use an Integrated Development Environment (IDE) such as RStudio

Very few people use base R, most use an Integrated Development Environment (IDE) such as RStudio

Very few people use base R, most use an Integrated Development Environment (IDE) such as RStudio

Very few people use base R, most use an Integrated Development Environment (IDE) such as RStudio

Very few people use base R, most use an Integrated Development Environment (IDE) such as RStudio

Tip

R is designed for scripts. Write your code in scripts, so that you can: (1) track everything you have done; (2) reproduce everything you have done; (3) share everything you have done.

Very few people use base R, most use an Integrated Development Environment (IDE) such as RStudio

  • Pro tip: the first time you use RStudio, go to Tools –> Global Options… and
    1. Untick the box Restore .RData into workspace at startup
    2. Set Save workspace to .RData on exit to Never
    • I really advise you to do this so that each time you open a new R session, you start with a brand new, empty environment. Imagine still having loaded data named expe1 and not understanding why the right columns are not showing? Been there, done that

R is an object-oriented language

  • R is essentially a big, fancy calculator
    • Do you want to find \(\hat{\boldsymbol{\beta}} = \underset{\boldsymbol{\beta}}{\arg\min} \; \sum_{i=1}^{n} \big(y_i - \mathbf{x}_i^\top \boldsymbol{\beta}\big)^2\) ? It can do that for you — if you know how to ask!
  • R works with objects. Objects can be (almost) anything
    • A single number or multiple numbers
    • A single character or character string (a word, multiple words, a sentence, multiple sentences, etc.)
    • A data frame comprising numbers and character strings
    • A plot
    • The output of a Generalised Random Forest
    • And, most importantly, all of the above, nested like Russian dolls 🪆
  • Combine a calculator with objects and you get the essence of R: a large calculator that manipulates objects

R is basically a big (fancy) calculator

# Addition
1 + 1 
[1] 2
# Subtraction
19 - 8
[1] 11
# Division
log(x = 98) / sqrt(x = 108)
[1] 0.4411887
# Multiplication
78^83 * (1 / log(x = sqrt(x = 32^3), base = exp(1)))
[1] 2127969009576438321662422224286864628246682226688202424468224488848044260660204006846282880004802664460642226688200044466840040824846488266880020484242462244

Tip

Anything that comes after a # in an R script is a comment that will never be run by R.

Comments are very useful for remembering why and how you wrote your script and for explaining it to yourself and others (think reproducibility).

R is basically a big (fancy) calculator that manipulates objects

adding_1 <- 1 + 1
subtract_1 <- 19 - 8
divide_1 <- log(x = 98, base = exp(1)) / sqrt(x = 108)
multiply_1 <- 78^83 * (1 / log(x = sqrt(x = 32^3), base = exp(1)))

adding_1 + subtract_1 / (divide_1 * multiply_1)
[1] 2

Notes

  1. The <- operator means that I am assigning what is on the right of the operator to the object that is on the left of the operator.

  2. When defining and manipulating objects, R (1) stores them in the Environment window; and (2) prints the used code and the answer in the Console window.

  3. Always use lowercase letters when naming your objects (R is case-sensitive) and avoid using diacritics or special characters. A good, albeit slightly long, name is lm_income_by_age_robustness_trimmed_1_99. This is a VERY BAD name: Predictor_âge_Sexe_robustness1_sansélèvesdéfaillants. Technically, it works. However, you’ll find yourself having to try very hard to write it correctly every time you need to.

Functions are a central element to R, that builds on objects

  • Functions are objects that
    • Take an input with arguments
    • Follow a series of instructions given arguments
    • Provide an output given arguments
  • For example, imagine that you want to square all the values in an object
x <- c(10, 4, 7, 12, 3, 18, 4, 12, 8) # Manually define an object x that takes multiple values

sqrt(x = x)
[1] 3.162278 2.000000 2.645751 3.464102 1.732051 4.242641 2.000000 3.464102
[9] 2.828427

Note

In this case, the function only takes one argument, x. Usually, a function takes multiple arguments (x, y, dataframe, etc.).

Functions are a central element to R, that builds on objects

  • Functions are objects that
    • Take an input with arguments
    • Follow a series of instructions given arguments
    • Provide an output given arguments
  • For example, imagine you want to find the lowest value in a given column, add its line number, and then multiply the result by the time of day. You could write a function to perform this task
x <- c(10, 4, 7, 12, 3, 18, 4, 12, 8) # Manually define an object x that takes multiple values

min_x <- function(x) {                          # Take the object x as an input
  i <- which.min(x)                             # Find the smallest value of x
  val <- x[i] + i                               # Add the line number to the smallest value
  val * as.numeric(format(Sys.time(), "%H"))    # Multiply the result by the hour of the day (0–23)
}

min_x(x = x)
[1] 104

Note

In this case, the function only takes one argument, x. Usually, a function takes multiple arguments (x, y, dataframe, etc.).

Packages or libraries contain functions that perform calculations by manipulating objects

  • Packages are collections of functions, dataframes and documentation that come together
    • Some packages are built into R and are part of what is called base R
    • Some packages need to be downloaded. These are known as contributed packages and are extremely useful
  • Very few people work in base R. Those who do so do so because they
    • Need the full flexibility of R. For instance, Offer-Westort, Rosenzweig, and Athey (2024) wrote an adaptive assignment algorithm based on balanced linear Thompson sampling in base R, and their code can be accessed online
    • Need full control and reproducibility over their analyses. National statistics organisations such as INSEE mostly use base R because they cannot afford their statistics to be altered if an update to a package changes its calculation method
install.packages("tidyverse") # Install a package
library("tidyverse")          # Load    a package

Note

Simply installing a package is not enough: you also need to load it each time you launch a new session.

Find out how to get the most out of packages

  • Packages can be very powerful. Make the most of them!
    • At least, they come with a Github. Just do a Google Search, and you will find their Github. Unfortunately, the github contains more or less informations about the functions and their arguments
    • Here, the dplyr package

Find out how to get the most out of packages

Find out how to get the most out of packages

  • Most of the packages pages follow the same structure
    • Get started provides an introduction to the package, offering a summary of its key objectives and functions alongside relevant R code
    • Reference gathers all functions in one place
    • Articles shows how to perform the package’s main functions, with accompanying R code
    • News section is the changelog file that tracks all changes for every new release of the package

Packages or libraries contain functions that perform calculations by manipulating objects

By Albert Y. Kim on X

3 Load a dataframe

You are (almost) ready to load your first dataframe: a Comma-Separated Values (CSV) file

  • Most of the data that you’ll use will be in the same formats
    • CSV or TXT. The most used ones for psychologists. Easy to share, light, easy to import/export.
    • XLSX. Excel’s format. Also easy to share and to export, although a bit more heavy. Does not have an advantaged over CSV to share a dataframe
    • SAV, or SAS, or DTA. Respectively, SPSS’s, SAS’, and Stata’s formats. You’ll learn how to use it when you need to, using the haven package
    • PARQUET, a file format specifically designed for (very) large data sets (more information here). Again, you’ll learn how to use it when you need to, using the nanoparquet package
  • Why almost ready ? Because we have not yet talked about the Working directory
    • R needs to be told everything, including were to load the data from. My best advice is to work with R projects
    • R Projects are .Rproj files that make your life much easier. Essentially, it tells R what your working directory is, so that you can work with a folder (and subfolders) for each of your data analysis project

Work with R Projects

  • To define a new R Project, go to File –> New Project… and follow the steps

  • An alternative is to define the working directory manually. I am not fond of this approach, as it can be cumbersome and vulnerable to even minor changes in the directory file
setwd("C:/Users/jaime/OneDrive/Documents/Emplois/Enseignements/Paris-Cité/2025 - Introduction à R/Support")

You are ready to load your first dataframe: a Comma-Separated Values (CSV) file

  • I have downloaded a dataframe and saved it is in the folder Data that is in my working directory
df <- read.table(file = "Data/penguins.csv", 
                 header = TRUE, 
                 sep = ",", 
                 row.names = "X")

Note

Errors may occur when loading data. CSV files can be temperamental, and there are multiple standards in use. The reason for this is that some standards use commas , as column separators, while some use semicolons ;. Moreover, some standards use periods . as decimal points and some use commas , as decimal points. Sometimes you need to dig in manually. This could be due to the column separator (does it use tabulation?), because the file contains row names that you did not specify (are you using a small x instead of a capital X?), etc… It can be a bit cumbersome — we’ve all been there!

Tip

You can make your life much easier by using File –> Import Dataset 😉

Before we go any further, let’s take a closer look at the read.table() function

df <- read.table(file = "Data/penguins.csv", 
                 header = TRUE, 
                 sep = ",", 
                 row.names = "X")
  • A function takes arguments
    • Learn more about a function by typing ?read.table() in the console
    • Documentation displays the default options (header = FALSE). Arguments with no option in the documentation indicate that a value is required
    • file = part is not necessary, as long as you respect the order of the arguments
    • "" are necessary to indicate a string. Else, R will consider that Data/penguins.csv is an object
    • Arguments requiring true or false segments can either take TRUE or T; FALSE or F

Check the importation of your first dataframe

  • colnames() returns the name of the columns (which are vectors)
colnames(x = df)
[1] "species"           "island"            "bill_length_mm"   
[4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
[7] "sex"               "year"             

Check the importation of your first dataframe

  • nrow() and ncol() respectively return the number of rows and of columns
nrow(x = df)
[1] 344
ncol(x = df)
[1] 8

Check the importation of your first dataframe

  • head() returns the first rows of an object
head(x = df)
  species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1  Adelie Torgersen           39.1          18.7               181        3750
2  Adelie Torgersen           39.5          17.4               186        3800
3  Adelie Torgersen           40.3          18.0               195        3250
4  Adelie Torgersen             NA            NA                NA        <NA>
5  Adelie Torgersen           36.7          19.3               193        3450
6  Adelie Torgersen           39.3          20.6               190        3650
     sex year
1   male 2007
2 female 2007
3 female 2007
4   <NA> 2007
5 female 2007
6   male 2007

Check the importation of your first dataframe

  • str() returns the structure of the object
str(object = df)
'data.frame':   344 obs. of  8 variables:
 $ species          : chr  "Adelie" "Adelie" "Adelie" "Adelie" ...
 $ island           : chr  "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
 $ bill_length_mm   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int  181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : chr  "3750" "3800" "3250" NA ...
 $ sex              : chr  "male" "female" "female" NA ...
 $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
  • chr, num and int indicate the type of data (vectors) that the dataframe contains

Note

Can you spot the error(s) ?

Check the importation of your first dataframe

  • summary() returns a summary of the object
summary(object = df)
   species             island          bill_length_mm  bill_depth_mm  
 Length:344         Length:344         Min.   :32.10   Min.   :13.10  
 Class :character   Class :character   1st Qu.:39.23   1st Qu.:15.60  
 Mode  :character   Mode  :character   Median :44.45   Median :17.30  
                                       Mean   :43.92   Mean   :17.15  
                                       3rd Qu.:48.50   3rd Qu.:18.70  
                                       Max.   :59.60   Max.   :21.50  
                                       NA's   :2       NA's   :2      
 flipper_length_mm body_mass_g            sex                 year     
 Min.   :172.0     Length:344         Length:344         Min.   :2007  
 1st Qu.:190.0     Class :character   Class :character   1st Qu.:2007  
 Median :197.0     Mode  :character   Mode  :character   Median :2008  
 Mean   :200.9                                           Mean   :2008  
 3rd Qu.:213.0                                           3rd Qu.:2009  
 Max.   :231.0                                           Max.   :2009  
 NA's   :2                                                             

Check the importation of your first dataframe

  • describe(), well… describes the object
library(psych)
describe(x = df)
                  vars   n    mean    sd  median trimmed   mad    min    max
species*             1 344    1.92  0.89    2.00    1.90  1.48    1.0    3.0
island*              2 344    1.66  0.73    2.00    1.58  1.48    1.0    3.0
bill_length_mm       3 342   43.92  5.46   44.45   43.91  7.04   32.1   59.6
bill_depth_mm        4 342   17.15  1.97   17.30   17.17  2.22   13.1   21.5
flipper_length_mm    5 342  200.92 14.06  197.00  200.34 16.31  172.0  231.0
body_mass_g*         6 342   44.87 24.67   42.00   44.19 29.65    1.0   94.0
sex*                 7 333    1.50  0.50    2.00    1.51  0.00    1.0    2.0
year                 8 344 2008.03  0.82 2008.00 2008.04  1.48 2007.0 2009.0
                  range  skew kurtosis   se
species*            2.0  0.16    -1.73 0.05
island*             2.0  0.61    -0.91 0.04
bill_length_mm     27.5  0.05    -0.89 0.30
bill_depth_mm       8.4 -0.14    -0.92 0.11
flipper_length_mm  59.0  0.34    -1.00 0.76
body_mass_g*       93.0  0.24    -1.06 1.33
sex*                1.0 -0.02    -2.01 0.03
year                2.0 -0.05    -1.51 0.04

Note

describe() treats non-numerical data as numerical and indicates doing so with ✱. Be careful!

Check the importation of your first dataframe

  • datasummary_skim() describes the object, with a visual aid for numerical data
library(modelsummary)
datasummary_skim(data = df, type = "numeric")
Unique Missing Pct. Mean SD Min Median Max Histogram
bill_length_mm 165 1 43.9 5.5 32.1 44.5 59.6
bill_depth_mm 81 1 17.2 2.0 13.1 17.3 21.5
flipper_length_mm 56 1 200.9 14.1 172.0 197.0 231.0
year 3 0 2008.0 0.8 2007.0 2008.0 2009.0

Check the importation of your first dataframe

  • datasummary_skim() describes the object
datasummary_skim(data = df, type = "categorical")
N %
species Adelie 152 44.2
Chinstrap 68 19.8
Gentoo 124 36.0
island Biscoe 168 48.8
Dream 124 36.0
Torgersen 52 15.1
sex female 165 48.0
male 168 48.8
NA 11 3.2

Caution

Warning: These variables were omitted because they include more than 50 levels: body_mass_g.

Notes

If you cannot remember to which package a function belongs, for instance datasummary_skim(), you can type ??datasummary_skim() in the console. This will take a while, as R will search through every package you have ever installed for the function.

Check the importation of your first dataframe: The types of data

  • There are six different types of vector, each relating to a different kind of data
    • Numeric. Mostly, it is numbers with a decimal
    • Integer. Whole numbers (no decimal)
    • Logical. TRUE or FALSE data. Can also take values NA or not, representing Missing Values
    • Complex. Complex numbers. We won’t cover it, my brain already hurts (\(i^2 = -1 \text{ ?!}\))
    • Character. String values, whether letters, words, sentences… When ordered, characters become factors (for instance, Never; Sometimes; Always is an ordered factor, or ordered character)
    • Raw. We won’t cover it, as it is not used 99.99% of the time
  • These types of data are stored in vectors, which are R’s fundamental data structure. When it comes to data, in R, everything is either a vector or an aggregate of vectors

Check the importation of your first dataframe: Modify the type of data

  • R is usually able to assign the correct data type to a vector. However, as you saw, it failed for the body_mass_g column, treating it as a character rather than a numeric value
    • Question: Why is this a problem?
    • Let’s reassign it. Remember when I said there are base R and contributed packages? I’ll show you two ways to do the same operation
df$body_mass_g <- as.integer(x = df$body_mass_g)

str(object = df$body_mass_g)
 int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
df <- df %>%
  mutate(body_mass_g = as.integer(body_mass_g))
str(object = df$body_mass_g)
 int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...

Notes

  1. Use the $ sign to access vectors stored in objects, such as a dataframe. This method is not 100% reliable, depending on the structure of the object, but you will learn as you go along.

  2. Most of the time, you don’t need to work in base R to manipulate data. You will learn to use the tidyverse packages (Wickham et al. 2019), which will make your data manipulation much easier.

summary(object = df$body_mass_g)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   2700    3550    4050    4202    4750    6300       2 

4 How to manipulate data: The tidyverse and the pipe operator %>%

The tidyverse is a suite of packages intended to make data manipulation much easier

  • Again, everything can be done in base R. But, some manipulations are highly cumbersome in base R. To ease data wrangling, a suite of packages (and functions) were made under the name tidyverse
    1. Each variable forms a column
    2. Each observation forms a row (not a person: an observation)
    3. Each type of observational unit forms a table
install.packages("tidyverse") # Install a package
library("tidyverse")          # Load    a package
  • The tidyverse
    • Takes in tidy data
    • Uses well-named functions that reflect their purpose: for example, rename() renames columns and filter() filters columns according to conditions
    • Shares a consistent syntax

Before manipulating data: The pipe operator %>%

  • R works using nested expressions from the inside out. It can get quite ridiculous somehow. To avoid doing so, a pipe operator has been created
  • Base R
mean(x = 
  sqrt(x = 
    abs(x = 
      log(x = 
        as.numeric(x = 
          factor(x = c(-10, 0, 10, 100), 
                 levels = c(-10, 0, 10, 100))
        )
      )
    )
  )
)
[1] 0.7645279
  • Using the pipe operator
c(-10, 0, 10, 100) %>%                         # Create a vector
  factor(levels = c(-10, 0, 10, 100)) %>%      # convert to factor
  as.numeric() %>%                             # factor → numeric
  log() %>%                                    # take log
  abs() %>%                                    # absolute value
  sqrt() %>%                                   # square root
  mean()                                       # mean of the result
[1] 0.7645279
  • The %>% operator takes the object on its left side and makes it the first argument of the function on its right side. Said differently, R can now work linearly!
  • Another pipe operator |> also exists. You can find a summary of the main differences here. Personally, I use the %>% operator out of habit, and also because it is easier to use with an AZERTY keyboard

Rename vectors (columns in a dataframe)

  • Let’s translate a column to French 🥖
names(df)[names(df) == "species"] <- "Espèce"
df <- df %>% 
  rename("Espèce" = "species")

A tidyverse workflow, taken from Wickham, Cetinkaya-Rundel, and Grolemund (2023).

Create a new vector (column in a dataframe), based on an existing one

  • For example, what if I prefer to count in pounds rather than grams?
df$body_mass_pounds <- df$body_mass_g / 453.59237
df <- df %>% 
  mutate(body_mass_pounds = body_mass_g / 453.59237)
  • Get the mean and the standard deviation of this new column
mean(x = df$body_mass_pounds)
[1] NA
sd(x = df$body_mass_pounds)
[1] NA
  • R needs to be told what to do with missing values (NAs)
mean(x = df$body_mass_pounds, na.rm = TRUE)
[1] 9.263283
sd(x = df$body_mass_pounds, na.rm = TRUE)
[1] 1.768007

Filter a dataframe, according to a condition in a vector (column)

df %>%
  filter(body_mass_g >= mean(body_mass_g)) %>%
  nrow()
[1] 0
df %>%
  filter(body_mass_g >= mean(body_mass_g, 
                             na.rm = TRUE)) %>%
  nrow()
[1] 149

Pivot a dataframe: 1/2

  • We sometimes need to pivot data in order to run a repeated measures analysis

id

group

before

after

1

control

40.48391

42.24692

2

control

52.06518

54.77531

3

control

53.46311

40.54112

4

control

52.94401

53.91534

5

control

53.86674

51.82061

6

test

45.80264

52.06510

7

test

54.11571

57.03168

8

test

49.76629

66.47595

9

test

56.28729

56.44203

10

test

53.39197

64.62119

Pivot a dataframe: 2/2

  • We sometimes need to pivot data in order to run a repeated measures analysis
df_wide %>%
  pivot_longer(
    cols = c(before,
             after),
    names_to = "time",
    values_to = "value")

Note

A pivot_wider() function is also available, depending on the required pivoting.

id

group

time

value

1

control

before

40.48391

1

control

after

42.24692

2

control

before

52.06518

2

control

after

54.77531

3

control

before

53.46311

3

control

after

40.54112

4

control

before

52.94401

4

control

after

53.91534

5

control

before

53.86674

5

control

after

51.82061

6

test

before

45.80264

6

test

after

52.06510

7

test

before

54.11571

7

test

after

57.03168

8

test

before

49.76629

8

test

after

66.47595

9

test

before

56.28729

9

test

after

56.44203

10

test

before

53.39197

10

test

after

64.62119

5 Summarise data

The base R functions table() and prop.table()

  • R has a built-in table() function. It won’t get you very far, but it is sometimes enough to quickly skim through your data (check how many categories a column has; whether some categories need to be merged together; the number of observations per category…)
df %>%
  select(sex, species) %>%
  table()
        species
sex      Adelie Chinstrap Gentoo
  female     73        34     58
  male       73        34     61
  • You can get the percentages by adding the prop.table() function
df %>%
  select(sex, species) %>%
  table() %>%
  prop.table() * 100
        species
sex        Adelie Chinstrap   Gentoo
  female 21.92192  10.21021 17.41742
  male   21.92192  10.21021 18.31832

Careful!

We did not check for missing data thoroughly enough. Reporting these tables without further exploring whether (and how much) data is missing would be a bold move.

Summarise data, the tough way

df %>%

  summarize(
    mean = mean(body_mass_g),
    sd = sd(body_mass_g),
    n = n(),
    se = sd / sqrt(n),
    confint_low = mean - 1.96 * se,
    confint_high = mean + 1.96 * se,
    .by = sex)
     sex     mean       sd   n       se confint_low confint_high
1   male 4545.685 787.6289 168 60.76689    4426.581     4664.788
2 female 3862.273 666.1720 165 51.86142    3760.624     3963.921
3   <NA>       NA       NA  11       NA          NA           NA
df %>%
  drop_na() %>%
  summarize(
    mean = mean(body_mass_g),
    sd = sd(body_mass_g),
    n = n(),
    se = sd / sqrt(n),
    confint_low = mean - 1.96 * se,
    confint_high = mean + 1.96 * se,
    .by = sex)
     sex     mean       sd   n       se confint_low confint_high
1   male 4545.685 787.6289 168 60.76689    4426.581     4664.788
2 female 3862.273 666.1720 165 51.86142    3760.624     3963.921
  • In this case, applying the drop_na() function to the entire dataset does not alter the results. However, most of the time it will. It would be better to do the following. Why?
df %>%
  select(body_mass_g, sex) %>%
  drop_na() %>%
  summarize(
    mean = mean(body_mass_g),
    sd = sd(body_mass_g),
    n = n(),
    se = sd / sqrt(n),
    confint_low = mean - 1.96 * se,
    confint_high = mean + 1.96 * se,
    .by = sex)
     sex     mean       sd   n       se confint_low confint_high
1   male 4545.685 787.6289 168 60.76689    4426.581     4664.788
2 female 3862.273 666.1720 165 51.86142    3760.624     3963.921

Summarise data, the tough way, but make it pretty(ish) using the flextable() package 1/2

Sex

Mean body mass (g)

sd

Number of penguins

se

95% confidence interval [lower]

95% confidence interval [upper]

Male

4,545.68

787.63

168

60.77

4,426.58

4,664.79

Female

3,862.27

666.17

165

51.86

3,760.62

3,963.92

Summarise data, the tough way, but make it pretty(ish) using the flextable() package 2/2

# Install the package and load it
install.packages("flextable")
library("flextable")

# Construct the table
df %>%
  select(body_mass_g, sex) %>%
  drop_na() %>%
  summarize(
    mean = mean(body_mass_g),
    sd = sd(body_mass_g),
    n = n(),
    se = sd / sqrt(n),
    confint_low = mean - 1.96 * se,
    confint_high = mean + 1.96 * se,
    .by = sex) %>%
  
  rename("Sex" = "sex",
         "Mean body mass (g)" = "mean",
         "Number of penguins" = "n",
         "95% confidence interval [lower]" = "confint_low",
         "95% confidence interval [upper]" = "confint_high") %>%
  
  mutate(across(where(is.numeric), ~ round(.x, 2)),
         Sex = ifelse(test = Sex == "male", yes = "Male", no = "Female")) %>%

# Format the table
  flextable() %>%
  bold(part = "header") %>%
  autofit() # %>%
# save_as_docx(path = "/Tables/Penguins_table_1.docx") # Or save_as_html()

Summarise data, the tough way, and export it to LaTeX using knitr() and kableExtra() packages

library("knitr")
library("kableExtra")

df %>%
  select(body_mass_g, sex) %>%
  drop_na() %>%
  summarize(
    mean = mean(body_mass_g),
    sd = sd(body_mass_g),
    n = n(),
    se = sd / sqrt(n),
    confint_low = mean - 1.96 * se,
    confint_high = mean + 1.96 * se,
    .by = sex) %>%
  rename("Sex" = "sex",
         "Mean body mass (g)" = "mean",
         "Number of penguins" = "n",
         "95% confidence interval [lower]" = "confint_low",
         "95% confidence interval [upper]" = "confint_high") %>%
  mutate(across(where(is.numeric), ~ round(.x, 2)),
         Sex = ifelse(Sex == "male", "Male", "Female")) %>%
  
  kable(format = "latex", booktabs = TRUE, align = "c") %>%
  kable_styling(latex_options = c("hold_position", "striped")) %>%
  save_kable("Tables/Penguins_table_1.tex")

Summarise data the easier way, using the gtsummary package

library(gtsummary)
theme_gtsummary_compact()
df %>%
  select(sex, species) %>%
  
  tbl_summary(by = "species", 
              percent = "cell", 
              statistic = all_categorical() ~ "{p}%")
Characteristic Adelie
N = 1521
Chinstrap
N = 681
Gentoo
N = 1241
sex


    female 22% 10% 17%
    male 22% 10% 18%
    Unknown 6 0 5
1
df %>%
  select(sex, species) %>%
  drop_na() %>%
  tbl_summary(by = "species", 
              percent = "cell", 
              statistic = all_categorical() ~ "{p}%")
Characteristic Adelie
N = 1461
Chinstrap
N = 681
Gentoo
N = 1191
sex


    female 22% 10% 17%
    male 22% 10% 18%
1

Careful!

We were right to be careful: 11 observations had missing data for penguins’ sex!

Combine multiple data wrangling functions at once, to summarise data

df %>%
  select(body_mass_g, sex, species, bill_length_mm, bill_depth_mm, year) %>%
  mutate(body_mass_pounds = body_mass_g / 453.59237,
         sex = case_when(sex == "female" ~ "Femelle",
                         sex == "male"   ~ "Mâle")) %>%
  rename("Espèce" = "species",
         "企鵝喙的長度(單位:毫米)" = "bill_length_mm",
         "عمق المنقار (مم) )" = "bill_depth_mm") %>%
  filter(body_mass_pounds >= (mean(body_mass_pounds, na.rm = TRUE) + sd(body_mass_pounds, na.rm = TRUE))) %>%
  drop_na() %>%
  
  tbl_summary(by = "sex", 
              statistic = list(all_continuous() ~ "{mean} ({sd})"))
Characteristic Femelle
N = 51
Mâle
N = 561
Espèce

    Gentoo 5 (100%) 56 (100%)
island

    Biscoe 5 (100%) 56 (100%)
企鵝喙的長度(單位:毫米) 46.16 (1.76) 49.59 (2.76)

عمق المنقار (مم)

14.44 (0.65) 15.76 (0.75)
flipper_length_mm 214 (5) 222 (6)
body_mass_g 5,140 (65) 5,534 (276)
year

    2007 1 (20%) 17 (30%)
    2008 3 (60%) 19 (34%)
    2009 1 (20%) 20 (36%)
body_mass_pounds 11.33 (0.14) 12.20 (0.61)
1 n (%); Mean (SD)

6 Do we have time for linear models?

If we have the time, we’ll fit some linear models. If not, that will be a bonus for you at home

  • You might be familiar with good old \(Y_i = \alpha + \beta X_i + \varepsilon_i\), also called the regression by its friends. Let’s fit it, with one, two, and three predictors, with interactions, and have a look at the results

\[ \begin{align} \text{Model 1: } Y_i &= \alpha + \varepsilon_i \\ \text{Model 2: } Y_i &= \alpha + \beta \, X_{i,\text{flipper length}} + \varepsilon_i \\ \text{Model 3: } Y_i &= \alpha + \beta \, X_{i,\text{flipper length}} + \lambda \, X_{i,\text{bill depth}} + \varepsilon_i \\ \text{Model 4: } Y_i &= \alpha + \beta \, X_{i,\text{flipper length}} + \lambda \, X_{i,\text{bill depth}} + \gamma \, (X_{i,\text{flipper length}} \times X_{i,\text{bill depth}}) + \varepsilon_i \end{align} \]

lm1 <- lm(formula = body_mass_g ~ 1, data = df)
lm2 <- lm(body_mass_g ~ 1 + flipper_length_mm, df)
lm3 <- lm(body_mass_g ~ 1 + flipper_length_mm + bill_depth_mm, df)
lm4 <- lm(body_mass_g ~ 1 + flipper_length_mm + bill_depth_mm + flipper_length_mm:bill_depth_mm, df)

Check the results, using base R summary() function

summary(lm1)

Call:
lm(formula = body_mass_g ~ 1, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-1501.8  -651.8  -151.8   548.2  2098.2 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  4201.75      43.36   96.89 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 802 on 341 degrees of freedom
  (2 observations effacées parce que manquantes)

Check the results, using base R summary() function

summary(lm2)

Call:
lm(formula = body_mass_g ~ 1 + flipper_length_mm, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-1058.80  -259.27   -26.88   247.33  1288.69 

Coefficients:
                   Estimate Std. Error t value            Pr(>|t|)    
(Intercept)       -5780.831    305.815  -18.90 <0.0000000000000002 ***
flipper_length_mm    49.686      1.518   32.72 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 394.3 on 340 degrees of freedom
  (2 observations effacées parce que manquantes)
Multiple R-squared:  0.759, Adjusted R-squared:  0.7583 
F-statistic:  1071 on 1 and 340 DF,  p-value: < 0.00000000000000022

Check the results, using base R summary() function

summary(lm3)

Call:
lm(formula = body_mass_g ~ 1 + flipper_length_mm + bill_depth_mm, 
    data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-1029.78  -271.45   -23.58   245.15  1275.97 

Coefficients:
                   Estimate Std. Error t value            Pr(>|t|)    
(Intercept)       -6541.907    540.751 -12.098 <0.0000000000000002 ***
flipper_length_mm    51.541      1.865  27.635 <0.0000000000000002 ***
bill_depth_mm        22.634     13.280   1.704              0.0892 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 393.2 on 339 degrees of freedom
  (2 observations effacées parce que manquantes)
Multiple R-squared:  0.761, Adjusted R-squared:  0.7596 
F-statistic: 539.8 on 2 and 339 DF,  p-value: < 0.00000000000000022

Check the results, using base R summary() function

summary(lm4)

Call:
lm(formula = body_mass_g ~ 1 + flipper_length_mm + bill_depth_mm + 
    flipper_length_mm:bill_depth_mm, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-938.88 -253.96  -28.25  220.66 1048.33 

Coefficients:
                                  Estimate Std. Error t value
(Intercept)                     -36097.064   4636.271  -7.786
flipper_length_mm                  196.074     22.603   8.675
bill_depth_mm                     1771.796    273.003   6.490
flipper_length_mm:bill_depth_mm     -8.596      1.340  -6.414
                                            Pr(>|t|)    
(Intercept)                       0.0000000000000855 ***
flipper_length_mm               < 0.0000000000000002 ***
bill_depth_mm                     0.0000000003057955 ***
flipper_length_mm:bill_depth_mm   0.0000000004782702 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 371.8 on 338 degrees of freedom
  (2 observations effacées parce que manquantes)
Multiple R-squared:  0.787, Adjusted R-squared:  0.7851 
F-statistic: 416.2 on 3 and 338 DF,  p-value: < 0.00000000000000022

Check the results, using the modelsummary package

library("modelsummary")

list_lm <- list(
  "Model 1" = lm1,
  "Model 2" = lm2,
  "Model 3" = lm3,
  "Model 4" = lm4)

modelsummary(list_lm,
             stars = T,
             fmt = 2,
             vcov = "HC3",
             title = "Table 1: An example of a couple of different linear models.",
             notes = "Notes. Y = Body mass (in grams). Robust standard errors 'HC3'. Noice, right!?",
             coef_rename = c(
               "flipper_length_mm" = "Flipper length (mm)",
               "bill_depth_mm" = "Bill depth (mm)",
               "flipper_length_mm:bill_depth_mm" = "Flipper length (mm) X Bill depth (mm)"), 
             gof_map = c("nobs", "adj.r.squared"))

Check the results, using the modelsummary package

Table 1: An example of a couple of different linear models.
Model 1 Model 2 Model 3 Model 4
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Notes. Y = Body mass (in grams). Robust standard errors 'HC3'. Noice, right!?
(Intercept) 4201.75*** -5780.83*** -6541.91*** -36097.06***
(43.43) (294.88) (500.35) (4389.62)
Flipper length (mm) 49.69*** 51.54*** 196.07***
(1.45) (1.75) (21.46)
Bill depth (mm) 22.63+ 1771.80***
(12.18) (259.54)
Flipper length (mm) X Bill depth (mm) -8.60***
(1.28)
Num.Obs. 342 342 342 342
R2 Adj. 0.000 0.758 0.760 0.785

7 Advice before we leave 😭

From Peter Peregrine’s blog

Functions and packages you’ll some day need

R is highly logical: think, ask, look for answers on forums!

  • What should you do if your code isn’t working?
  • What to do when you do not know how to do what you are aiming to do?
    • R is a highly logical operator. First thing to do is to ask yourself: (1) Where am I? (2) Where do I want to go, very precisely? (3) What are the logical steps that I should follow to go from (1) to (2)? Doing so, you’ll break the steps down to small bits, which are more manageable
    • Finally, the vast majority of your questions have already been answered online. Try Stack Overflow, which is dedicated to programming and has a section specifically for R. You can also try Cross Validated, which is dedicated to statistics. Look for answers before asking a new question!

Example of a Cheat sheet, for the dplyr package 1/2

Example of a Cheat sheet, for the dplyr package 2/2

8 (Free) online ressources

9 References

Arel-Bundock, Vincent, Noah Greifer, and Andrew Heiss. 2024. “How to Interpret Statistical Models Using Marginaleffects for r and Python.” Journal of Statistical Software 111 (9): 1–32. https://doi.org/10.18637/jss.v111.i09.
Barnier, Julien. 2024. Introduction à r Et Au Tidyverse.
Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using Lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.
Bergé, Laurent. 2018. “Efficient Estimation of Maximum Likelihood Models with Multiple Fixed-Effects: The R Package FENmlm.” CREA Discussion Papers, no. 13.
Croissant, Yves, and Giovanni Millo. 2008. “Panel Data Econometrics in R: The plm Package.” Journal of Statistical Software 27 (2): 1–43. https://doi.org/10.18637/jss.v027.i02.
Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020. Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data. https://doi.org/10.5281/zenodo.3960218.
Kuznetsova, Alexandra, Per B. Brockhoff, and Rune H. B. Christensen. 2017. lmerTest Package: Tests in Linear Mixed Effects Models.” Journal of Statistical Software 82 (13): 1–26. https://doi.org/10.18637/jss.v082.i13.
Larmarange, Joseph. 2024. Introduction à l’analyse d’enquêtes Avec r Et RStudio.
Lüdecke, Daniel, Mattan S Ben-Shachar, Indrajeet Patil, Philip Waggoner, and Dominique Makowski. 2021. “Performance: An r Package for Assessment, Comparison and Testing of Statistical Models.” Journal of Open Source Software 6 (60).
Lumley, Thomas. 2004. “Analysis of Complex Survey Samples.” Journal of Statistical Software 9 (1): 1–19.
Offer-Westort, Molly, Leah R Rosenzweig, and Susan Athey. 2024. “Battling the Coronavirus ‘Infodemic’among Social Media Users in Kenya and Nigeria.” Nature Human Behaviour 8 (5): 823–34.
Rohrer, Julia M, and Vincent Arel-Bundock. 2025. “Models as Prediction Machines: How to Convert Confusing Coefficients into Clear Quantities.” https://doi.org/10.31234/osf.io/g4s2a_v1.
Rosseel, Yves. 2012. lavaan: An R Package for Structural Equation Modeling.” Journal of Statistical Software 48 (2): 1–36. https://doi.org/10.18637/jss.v048.i02.
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59: 1–23.
———. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, Mine Cetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. 2nd ed. O’Reilly Media.
Zeileis, Achim, Susanne Köll, and Nathaniel Graham. 2020. “Various Versatile Variances: An Object-Oriented Implementation of Clustered Covariances in R.” Journal of Statistical Software 95 (1): 1–36. https://doi.org/10.18637/jss.v095.i01.